1 Abstract

2 Background

Adelaide’s population increased from 1.1 million to 1.3 million residents between 2006 and 2016, with 66 million more kilometers traveled on the road network during that time. Infrastructure Australia paints a dire picture of the level of road congestion in Adelaide and its continued worsening in the coming years in line with both an increasing population and an increasing reliance on public transport in comparison to cars. The report estimated the annualized cost of road congestion for Greater Adelaide to be approximately $1.4 billion in 2016 and is projected to rise to $2.6 billion in 2031 (Infrastructure Australia, 2019).

With this backdrop in mind, the client (the South Australia Department for Infrastructure and Transport (DIT)) has in its possession an untapped wealth of data relating to traffic information collected through Bluetooth probes, which take count of passing motor vehicles in a particular time and location, therefore producing a metric for road congestion.

This data will be examined in conjunction with publicly available, historical real time bus trip updates collected by General Transit Feed Specification Realtime (GTFSr), which provides the arrival time for each stop on a bus’s trip. The analysis aims to identify the relationship and robustness of bus travel times to road congestion on road segments of interest.

3 Objectives

The aim of the proposed analysis is to investigate the extent of the relationship between bus travel times and road congestion - as measured by motor vehicle travel times - on identified road segments, where a strong relationship indicates a road segment where the bus travel times are less robust to congestion.

Initially, the bus performance metric to be used and applied was the average delay experienced by a bus trip on the segment of interest, as measured by a stop’s predicted arrival time versus the scheduled arrival time. However, this was later revised to measuring the bus travel time between the first and last stops of a segment, removing the possibility that we are measuring how accurately the schedule predicts and/or buffers for congestion. The road congestion metric used is the average travel time of the vehicles across the segment.

A proposal outlining the analysis, the objectives, and the methodology was created and sent to the client, this was followed by a discussion with the client to provide more information regarding the analysis and clarify any points raised by the client. Ultimately, an agreement was reached for the analysis to fulfill the following objectives:

  1. Detailed travel time or congestion analysis comparing public transport response to road traffic on selected sections of road over a given period of time

  2. Repeatable methodology, code, functions, and visuals that produce detailed analysis on other segments of interest

In fulfilling the first objective, the segments of road analysed are South Road and Marion Road in Adelaide. This report uses the former to illustrate the methodology, while the latter is used for comparison. The period of time chosen is March 2022.

Regarding the second objective, the methodology and the code created aim to ensure the requirement of as little manual input and edits as possible when applied to different road segments.

The analysis undertaken in this report will form the basis of future analysis into:

4 Data Sources, Description, and Wrangling

Three main data sources are used: DIT Addinsight, GTFS, and GTFSR. These data sources and their associated sub-sources will be outlined below. All the data is stored in the cloud using Amazon Web Services (AWS) and is accessed through Athena which uses regular SQL syntax.

The data cleaning and wrangling will be discussed as it was the part of the analysis that required the highest workload.

As mentioned above, the methodology will be illustrated on South Road, which is one of Adelaide’s most important and major roads, and regularly suffers from congestion (Infrastructure Australia, 2019).

South Rd on map. Source: Google Maps

Figure 4.1: South Rd on map. Source: Google Maps

4.1 General Transit Feed Specification (GTFS)

This is a common format developed by Google and used by public transport agencies around the world and contains static or scheduled information about public transport services such as routes, stops, schedule and geographic transit information. For the purposes of this analysis, only the bus routes and bus stops datasets will be used.

4.1.1 Routes

These are the bus routes that go through South Road. These were identified by overlaying all the network routes on a map in Tableau and the routes on South Road were highlighted and exported to a list. The dataset contains simply the unique collection of route_id on the segment.

4.1.2 Stops

The list of bus stops on the segment were identified in the same fashion as when identifying the routes. This produces a list of the stop_id’s on the segment.

A dataset containing all the stops in the bus network and information relating to each stop is used and filtered to only the stops present on the segment (for the purposes of this report, the file containing all the stops on the network was pre-filtered to the stops on the segment only to accommodate Github file size limits. However, the code and methodology contained here apply as if the complete dataset were used and filtered through the code).

Table 4.1: Stops data description
Variable Description
stop_id Unique stop identifier
stop_name Name of the location. Uses a name that people will understand
stop_desc Address of the stop
stop_lat Latitude of the stop
stop_lon Longitude of the stop
direction Road direction of the stop

The direction variable is manually created. In this case, if the stop is on the east side of South Road, then it is southbound (SB) away from the city; if the stop is on the west side of South Road, this it is northbound (NB) towards the city.

The bus stops will be plotted on a map to confirm they are all, in fact, on South Road.

Figure 4.2: Bus stops on South Road

4.2 General Transit Feed Specification Realtime (GTFSR)

Unlike GTFS which provides static information, GTFSR provides real time information consisting of two types. The first type is a trip’s real time updates regarding a bus stop’s expected arrival times and delays. The second type is a real time update of a bus’s geographic position and speed at a specific point in time. This analysis uses the former only.

4.2.1 Trip Updates

Once the bus routes that go through the segment were identified as outlined above, the real time updates for all the trips in March 2022 according to the routes were retrieved from the AWS database using Athena. This dataset is used to derive the bus travel time through the segment, which is the first element in the relationship being assessed in this analysis, with the other being the vehicles travel time as a measure of congestion. The SQL query to retrieve the updates can be found in appendix 6.1.

First, the unedited data will be described.

Table 4.2: Unedited updates data description
Variable Description
route_id Unique route identifier
start_date Start date of the trip
vehicle_id Unique vehicle identifier
timestamp Timestamp of the real time update
trip_id Unique trip identifier
stop_sequence Order of stops for a particular trip
stop_id Unique stop identifier
delay The current schedule deviation for the trip. The delay (in seconds) can be positive (meaning that the vehicle is late) or negative (meaning that the vehicle is ahead of schedule)
arrival_time Predicted arrival time for a stop on a particular trip

It is important to note the following:

  • One route_id can have many trip_ids

  • One trip_id occurs a maximum of one time a day, the trip_id can occur on multiple days

As a bus trip is occurring, at a certain timestamp a real time prediction of the arrival times of the remaining stops on the trip are updated.

Cleaning and wrangling this dataset proved to be the most challenging and time consuming section of this analysis, with many methodologies, cleaning iterations, and code trialed to arrive at the optimal treatment. This is due to the complex relationships between the observations in the dataset, and the variety of errors and inconsistencies encountered.

The following preliminary adjustments were done:

  1. As each stop on a given trip can have multiple arrival time predictions with each update timestamp prior to reaching that stop, the SQL query insures that each stop only has the predicted arrival time corresponding to the latest timestamp, given that the later the prediction, the more accurate it is.

  2. As a trip can begin and end outside the bounds of the segment, the updates were constrained only to those stops within the segment, in either direction.

  3. Weekends and holidays were removed as we are interested in the relationships during working days.

A new variable to_stop_time was created. This variable measures the time taken to reach each stop from the prior stop in seconds, within each trip. The variable was created to facilitate a potential deeper understanding of the data, to highlight any errors, and for potential utilities in the future such as drilling down to examine the pattern on a stop-basis.

Through this variable, a range of errors were discovered that needed to be amended. This is how the data appears before any remedial actions are taken.

Unedited to-stop times contain negative values

Figure 4.3: Unedited to-stop times contain negative values

Figure 4.3 shows that to_stop_time contains negative values to the left of the red line, this is a clear error as it is not possible for time taken to reach a stop to be negative. Additionally we can see very high delay values in clusters, above 4,000 seconds, which is over an hour long.

In total, there were eight types of errors identified in the data. The list of errors, an example of each error, and the code to rectify the errors can be found in appendix 6.2.

Great effort was put into identifying each type of error and remedying it in a way that does not produce further errors, or that removes large amounts of data, identifying the correct order of the types of errors to be tackled was also essential. In addition, formulating the code to fix each error required various trial and error iterations. This was all done to ensure the errors were removed as surgically as possible to minimize data loss and due to the sensitive nature of the relationships between the stops on each trip.

The percentage of error entries located and fixed in the data was 3.82%. The cleaned data now appears as follows:

Cleaned to-stop times do not contain negative values

Figure 4.4: Cleaned to-stop times do not contain negative values

With the data now cleaned, two additional variables were created called first_stop and last_stop, which identify the first and last stops of each trip within the segment. The total time per trip can now be derived by calculating the time between the first stop and last stop of the trip within the segment. The arrival time of the first stop and last stop on the segment will be regarded as the start and end time, respectively, of the trip. The distribution of the trip times per direction is shown below. The two most occurring first-last stops pair per direction will be used.

Different stops pairs in the same direction have different trip times

Figure 4.5: Different stops pairs in the same direction have different trip times

As figure 4.5 shows, different first-last stops pairs in the same direction have different travel times. This means that different routes and trips can have different travel times solely based on their respective first and last stops on the segment, this renders the travel time between them incomparable as they occupy different distances. Therefore, only trips with the same pair of first and last stops within the segment will be kept, with the remaining trips discarded; there can be only one pair of first and last stops per direction, so that the distance is constant for the all the trips and the time is therefore comparable.

This pair of stops is identified as the most occurring pair per direction. Now, only trips with this pair of first and last stops are kept in the data. The stops pair per direction can be see in the map below

Figure 4.6: Most occurring pair of bus stops per direction

The distribution of the trip times per direction is shown below.

Excessively large trip times exist, especially southbound

Figure 4.7: Excessively large trip times exist, especially southbound

As figure 4.7 shows, excessive trip times occur, especially southbound. It is difficult to determine whether these are errors or genuine trip times without using further information. A variable called delay_diff is created which calculates the size of the difference between the delay of the first stop and the delay of the last stop per trip. Excessive values of this variable indicates the large travel time is due to an error as either of the stops has an artificially large delay or early arrival. A plot of delay_diff vs travel_time is shown below.

Size of difference between first and last stop delays. Southbound is more problematic

Figure 4.8: Size of difference between first and last stop delays. Southbound is more problematic

Based on figure 4.7, trips with a delay_diff greater than 600 (10 minutes) were removed as they were most likely errors. The resulting data now appears as follows:

Excessively large trip times no longer exist

Figure 4.9: Excessively large trip times no longer exist

Figure 4.9 shows that travel times southbound away from the city are larger and more dispersed than the travel times northbound towards the city.

Greater variation exists between the travel times of both directions in the evening

Figure 4.10: Greater variation exists between the travel times of both directions in the evening

Figure 4.10 shows that travel times are in fact very similar during the morning rush hour, while in the evening, travel time southbound away from the city is longer as expected.

The data relating to only the first stop per trip was kept since the arrival time of the first stop will be considered as the trip start time used as the basis for aggregation later in the analysis.

The data in the bus travel times will be split into five minute time periods, with the arrival time of the first stop on the trip used as the basis for this segregation. For example, all bus trips that start between 2022-03-01 12:00:00 and 2022-03-01 12:05:00 will be included in the same time frame. Since each time frame can contain multiple trips, the bus travel times will be averaged into one average bus travel time, this is done to establish a one-to-one relationship with the vehicles travel time, which are also in five minute intervals. The final dataset looks as follows:

Table 4.3: Aggregated bus trip travel times data description
Variable Description
day Date of measurement
time Time of the day in hour:minute:seconds of the measurement
hour The hour of the measurement
rush Whether the measurement occurrs during rush hour. Morning rush hour occurrs between 6am and 10am, evening rush hour occurrs between 3pm and 7pm, neither otherwise
direction The direction of travel
number_buses The original number of trips during the five minute interval
bus_time The bus trip travel time across the segment

4.3 DIT Addinsight

This is traffic information collected by DIT, which is done through the use of Bluetooth devices that tag a Bluetooth-equipped vehicle when it comes into its range. The location of a Bluetooth device is called a site, and a link is a segment of road between two sites, an origin site and a destination site. This allows for the calculation of metrics such as the time taken to travel through the link, among others.

The DIT Addinsight database is very large and contains many tables, each recording its own set of information, with foreign keys connecting most tables. This analysis uses only a subset of the tables in the database, presented below.

4.3.1 Holidays

This dataset contains holidays dates, which is the only variable used.

5 Analysis

5.1 Travel Times Comparison

The travel times from both sets will be compared against one another. This is done to gain a general understanding of the relationship as well as to validate the datasets, as we would expect to observe a similar pattern between both travel times. The comparison will be done through a series of graphs.

Vehicles are faster in both directions. Distributions of both types resemble each other

Figure 5.1: Vehicles are faster in both directions. Distributions of both types resemble each other

Figure 5.1 shows that for northbound travel to the city, vehicle travel time largely remains the same during both periods of rush hour, while bus travel time actually increases in the evening, a surprising result. While for southbound travel away from the city, both travel times increase as expected. The bus travel times are generally slower than vehicle travel times, and both types exhibit general patterns overall.

Northbound bus travel times are slower than vehicle travel times. Both exhibit similar patterns

Figure 5.2: Northbound bus travel times are slower than vehicle travel times. Both exhibit similar patterns

Southbound bus travel times are slower than vehicle travel times. Both exhibit similar patterns

Figure 5.3: Southbound bus travel times are slower than vehicle travel times. Both exhibit similar patterns

Both figures 5.2 and 5.3 show that the travel times from both types generally follow a similar pattern, this indicates the data from both data sets are valid, as we would not expect to see very different patterns. Buses are almost always slower than vehicles, as buses need to load and unload passengers at the various bus stops along the road, in addition to them accelerating at a slower rate since they are heavy vehicles. We also notice that towards the end of the day in both directions, the travel times seem to level off at a low value, this is likely due to less traffic being present on the road in the evening time, leading to a faster traversal through the segment, with only constant factors affecting the travel time such as the speed limit and traffic lights.

A scatter plot of the travel times will be examined:

A positive relationship exists between both travel times for both directions during peak times

Figure 5.4: A positive relationship exists between both travel times for both directions during peak times

Figure 5.4 shows that a positive relationship exists between the travel times, more so for the southbound direction in the evening.

The correlations between the travel times are:

Table 5.1: Correlation between travel times per direction per rush hour
Rush Direction Correlation
Morning NB 0.80
Morning SB 0.49
Evening NB 0.67
Evening SB 0.87

Table 5.1 shows that the travel times between buses and vehicles are high correlated in the morning northbound towards the city, and in the evening southbound away from the city.

To gain a clearer picture of the patterns and relationship throughout an average day, the travel times within 30 minute aggregates of the same time frame will be averaged across all the days. For example, for each travel type, all the measures occurring between 12:00 and 12:30 across all the days will be averaged, then plotted.

Average travel time patterns by vehicle and direction. Peak times are highlighted

Figure 5.5: Average travel time patterns by vehicle and direction. Peak times are highlighted

Figure 5.4 shows the average pattern of travel times across the day, by direction and type. The morning and evening rush hours have been highlighted as they are the parts of the day worth of examination. Analyzing northbound travel towards the city, both rush hour times display a similar level of travel time for both types, and the travel time in the rush hours are not much greater than non-rush hour times. This is an unexpected result as it is expected that travel times northbound towards the city would be higher in the morning rush hour. Southbound travel away from the city, however, follows expectations as the travel time for both types dramatically increases in the evening rush hour as workers leave the city.

5.2 Travel Times Variation Analysis

The goal of the analysis to ascertain the extent of the relationship between the variations in the motor vehicle travel times and the variations in the bus trip travel times, the variation is in reference to travel times during the same time frame across the entire period. In other words, if the vehicle travel time varies by a certain level relative to the usual travel time during the same time frame, can we observe a reflection of this variation in the bus travel time? If so, by how much?

In order to assess the variation, the travel times will be standardized. The function standardiser is created which separately standardizes both the bus travel times and vehicle travel times according to the total data in the entire period based on either:

  • the five minute time frame. For example, a bus/vehicle travel time on 2022-03-01 7:00am would be standardized against all the other travel times that occur on 7:00am in the period

  • the hour of travel. For example, a bus/vehicle travel time that occurs on 2022-03-01 between 7am and 8am would be standardized against all the other bus trips in the period that occur during that hour

  • the rush hour of travel. For example, a bus/vehicle travel time that occurs on 2022-03-01 during the morning rush hour would be standardized against all the other bus trips in the period that occur during morning rush hour

These options are provided to the function as an argument (“time”, “hour”, “rush”). As the time frame widens, more data is available for standardization. This is why in addition to standardizing the data, the standardiser function also stores the total number of data points present in each time frame according to the method chosen. The function also removes observations greater than three standard deviations away as these are considered outliers that can affect the analysis.

Ideally, the travel times would be standardized according to the same five minute time frame across the entire period as this would provide the highest accuracy. However, as the bus trips per five minute time frame were averaged into one five minute travel time, and we are analyzing only one month of data containing 21 working days, there is not enough travel times to accomplish this, since there would be a maximum of 21 data points per five minute time frame used for standardization. Instead, the default standardization parameter is by hour, which provides a much greater number of data points.

With the bus and vehicle travel times standardized, we can now examine the relationship between the travel times with respect to variation. If the vehicle travel time deviates from the average, do we observe a similar deviation by the bus travel time?

Plots of the standardized travel times are shown below:

Northbound morning travel time variations are similar

Figure 5.6: Northbound morning travel time variations are similar

Southbound evening travel time variations are similar

Figure 5.7: Southbound evening travel time variations are similar

Figure 5.6 and figure 5.7 show that variations in vehicle travel times are in fact closely matched by variations in bus travel times, especially during the morning and northbound towards the city, where the variations also exist in greater magnitudes.

A positive relationship exists between both travel times for both directions during peak times

Figure 5.8: A positive relationship exists between both travel times for both directions during peak times

Figure 5.8 shows that the standardized travel times are particulary correlated in the evening southbound away from the city.

The correlations between the standardized travel times are :

Table 5.2: Correlation between standardized travel times per direction per rush hour
Rush Direction correlation
Morning NB 0.67
Morning SB 0.22
Evening NB 0.44
Evening SB 0.74

Table 5.2 shows that strong correlation exists between the standardized travel times during the evening southbound away from the city.

The following plot increases the granularity of the correlation to a per hour figure:

High correlation is more uniform in the evening southbound

Figure 5.9: High correlation is more uniform in the evening southbound

Figure 5.9 shows that high correlation exists in the morning northbound towards the city between 7am and 9am, while in the evening southbound away from the city, high correlation exists throughout the rush hour and peaking betwen 6pm and 7pm.

6 Appendix

6.1 Trip Updates SQL Query

6.2 Updates Errors

  1. All stops for the trip have very large, or very small, similar delays. This means that the entire trip is very delayed or very early. This is most likely due to retrospectively entering the information at a later time. By examining 4.3, the threshold was set at 2400 seconds (40 minutes) delay and 900 seconds (15 minutes) early. These trips were removed to prevent incorrect analysis since they will be in the wrong time period

These trips were removed as they will be in the wrong timeframe when analyzing against the vehicle travel time.

  1. A stop has a sudden large predicted delay resulting in a much larger arrival time than that of the following stop. These stops were removed

  1. A stop has an arrival time later than any following stops and the timestamp is earlier than any following stops. These stops were removed to ensure the most recent timestamp is preferred when discrepancy occurs

  1. A stop’s arrival time is earlier than previous stops and the timestamp is older

  1. A stop’s arrival time is earlier than prior stops but they both have the same timestamp. In this case it is not possible to know which is correct. We assume the stop with the earlier stop sequence is correct since it is closer to the bus when the update is made

  1. Two consecutive stops have identical arrival times but with different timestamps. The stop with the older timestamp was removed

  1. Two consecutive stops have identical arrival times and timestamps. Remove the stop with a higher stop sequence

  1. Many stops on the same trip have the same arrival time likely due to retrospective entry error. These trips were removed

7 References